Coresets for k-Means and k-Median Clustering and their Applications
نویسندگان
چکیده
In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in IR, one can compute a weighted set S ⊆ P , of size O(kε−d log n), such that one can compute the k-median/means clustering on S instead of on P , and get an (1 + ε)-approximation. As a result, we improve the fastest known algorithms for (1 + ε)-approximate kmeans and k-median. Our algorithms have linear running time for a fixed k and ε. In addition, we can maintain the (1 + ε)-approximate k-median or k-means clustering of a stream when points are being only inserted, using polylogarithmic space and update time.
منابع مشابه
Distributed Balanced Clustering via Mapping Coresets
Large-scale clustering of data points in metric spaces is an important problem in mining big data sets. For many applications, we face explicit or implicit size constraints for each cluster which leads to the problem of clustering under capacity constraints or the “balanced clustering” problem. Although the balanced clustering problem has been widely studied, developing a theoretically sound di...
متن کاملOn the Sensitivity of Shape Fitting Problems
In this article, we study shape fitting problems, -coresets, and total sensitivity. We focus on the (j, k)-projective clustering problems, including k-median/k-means, k-line clustering, j-subspace approximation, and the integer (j, k)-projective clustering problem. We derive upper bounds of total sensitivities for these problems, and obtain -coresets using these upper bounds. Using a dimension-...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملAlgorithms for the Bregman k-Median problem
In this thesis, we study the k-median problem with respect to a dissimilarity measure Dφ from the family of Bregman divergences: Given a finite set P of size n from R, our goal is to find a set C of size k such that the sum of error cost(P,C) = ∑ p∈P minc∈C { Dφ(p, c) } is minimized. This problem plays an important role in applications from many different areas of computer science, such as info...
متن کاملTurning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
We prove that the sum of the squared Euclidean distances from the n rows of an n×d matrix A to any compact set that is spanned by k vectors in R can be approximated up to (1+ε)-factor, for an arbitrary small ε > 0, using the O(k/ε)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1+ε)approximated by an optimal k-means cl...
متن کامل